Linked Open Data Visualization Revisited: A Survey
نویسندگان
چکیده
syntax to make statements about resources, and thanks to LOD principles, construct links to available external datasets in a simple manner, making them accessible and queryable through the Internet and promoting data reusability. As the amount of information at hand is greater every day, the need to handle it efficiently becomes a key requirement for anybody interested in working with it. This situation settles a great scenario for the Information Visualization field (or InfoVis, as it is known by academics and industry), taking advantage of humans capacity to identify patterns and gain insights from visual representations of abstract data. InfoVis positions itself in the intersection of other data-related fields: Statistics, Analytics, Dissemination and so on. One of the biggest issues concerning mass adoption of LOD outside the Semantic Web (SW) community, is the technical and conceptual knowledge required to take full advantage of the benefits provided by this type of data publishing. Utilizing expressive visual representations, most users can employ their visual capacities to obtain a clear understanding of the data stored within the dataset. Interactive visualizations also offer the possibility to play and experiment with the data, allowing to perform exploratory knowledge discovery using the “follow your nose” principle [53]. This article is structured as follows: In section 2 background knowledge on information visualization is provided, addressing the best practices to represent abstract data in a visual manner to allow a coherent interpretation of them. Section 3 describes current approaches that deal with LOD visualization, and are later evaluated in Section 4 according to the previously defined features in order to solve LOD visualization issues. Finally, Section 5 discusses the findings of conducting this study and the conclusions drawn from it. 1http://5stardata.info/ 2. Background As progress stands on the shoulders of giants, it is important to compile existing research on this field in order to apply it to LOD scenarios. In accordance with Information Theory, vision is the sense with the largest bandwidth to send information to the brain [52], and humans ability to quickly understand complex data through it is reflected on the well known adage “a picture is worth a thousand words”. Promoted by John Tukey, Exploratory Data Analysis (EDA) [49] tries to summarize the main features of a dataset applying visual methods. This makes the EDA approach a perfect candidate to be followed in LOD visualization. As everyday more and more governments, public entities, organisations, etc. are encouraged (and sometimes forced due to transparency policies) to make public data accessible to citizens and interested third parties, automatizing the publication of information is a common approach among practitioners to easily expose huge amounts of files and documents to public consumption. The errors caused by automatic parsing and processing, together with the lack of correctly applying term disambiguation and the selected approach to deal with missing values, gives birth to LOD in need of a lot of pre-processing to be usable for a data analysis task. Likewise, the diversity of topics that those datasets deal with, make the automatic visualization of LOD a great challenge full of research opportunities. Tables have been largely used to display LOD. When consulting information about a resource (object), a table is generated with as many rows as attribute instances: a first column with the property name (or IRI), and a second column with the value. The table layout has been popularised by tools similar to Pubby [24], in charge of the generation of the green-ish HTML pages of DBpedia articles. Regarding topic diversity, domain specific tools such as FoaF Explorer [13], map4rdf [34], LinkedGeoData browser [48], etc. display a well selected set of visual representations, as result of being tailored for a concrete set of ontologies within a well known environment. As diversity increases, more vocabularies are designed to reflect the details of a great amount of subjects, thus multi-domain or generalist approaches need to be designed in a manner that lets them manage and generate visualizations over different scenarios. O. Peña, U. Aguilera & D. López-de-Ipiña / Linked Open Data Visualization Revisited: A Survey 3 2.1. Datatype analysis Ben Shneiderman proposed seven basic datatypes [46] a data fragment could be classified into, stating a taxonomy which allows to tag a datum with a certain category, determining how it can be used and which operators are applicable. Following this taxonomy, an extended description of each datatype and their connection to LOD principles is detailed. – 1 dimensional: linear datatypes including textual documents, program source code and alphabetical lists of names which are all organised in a sequential manner. Unidimensional data is usually displayed as lists of items organised by a single feature (e.g., alphabetical order), so it is uncommon to see it visualised. A especial case is when a data dimension has a narrow range of values repeated through the dataset, for example, the names of months or the department titles of an office. These values are known in statistics as categorical data, or factors. A simple aggregation of these values can be used for the creation of distribution analyses in a further step. – 2 dimensional (planar): planar or map data including geographic maps, floorplans or newspaper layouts. Geographical features offer an excellent opportunity to help users locate data instances on a map. In combination with map templating engines, data instances can be placemarked using different symbols, thus allowing resources to be distinguished attending to their class. The ability to pinpoint elements on a map may help uncovering element distribution patterns in the datasets, letting users identify the areas where resources are either tightly gathered or disperse. Advanced projection techniques can also enhance presentation by clustering elements together in association to the applied zoom level, or even addressing high-interest areas using heatmaps. Planar data can also be found as an array of bidimensional features, producing geometrical shapes which limit an area within a map. These bounding boxes, when overlayed to a base map, provide great insights of data which affect a greater geographical area, not just a unique, precise point. Finally, bidimensional data does not only produce visual representations on their own, but in aggregation with other data dimensions can create augmented visualizations of greater value. Adding labels, descriptions, images, etc. the map layout is enriched, making geospatial data to be fun to interact with by any user. Furthermore, data instances can be collected by geographical areas (such as countries, states, etc.) and normalised, encoding each region within a pre-established colour palette resulting in choropleth maps, or distort established borders to proportionally expose local contrasts over a set of variables using cartograms. – 3 dimensional (volumetric): real-world objects such as molecules, the human body, and buildings having items with volume and some potentially complex relationship with other items. Whereas this datatype is one of the pillars of scientific visualization, non-trained eyes may find difficult to correctly interpret what 3D graphs and charts are trying to represent. Traditionally related to huge datasets, this datatype adds complexity to non-trained users, requiring a developed spatial vision skill in order to understand the underlying data. Besides, pleasant rendering of both big data sources and 3D images on web browsers still comprises a challenge, but server-side preprocessing techniques together with WebGL’s features [1] should overcome the technical issues in the near-future. – Multi-dimensional: items from relational and statistical databases with n attributes becoming points in a n-dimensional space. The easiest manner in which n dimensional data can be defined is by taking an object, and providing values for each of its n attributes (with n > 1). The descriptions of m objects using those n features will give birth to a n ×m matrix, each row representing an object instance and each column collecting all the measurements for a given dimension. Due to its suitability to fit abstract models, mappings from multi-dimensional data to relational database schemas, spreadsheets or CSV files are quite trivial, and so is expressing these data by means of ontological class resources being the subject of predicate triples with the measured values as the objects. The number of dimensions can give clues about which visual representations are more appropriate in each case. As an example, a first choice to ex4 O. Peña, U. Aguilera & D. López-de-Ipiña / Linked Open Data Visualization Revisited: A Survey hibit a dataset by two of its dimensions would be to draw a scatter plot, each dimension represented over an axis and with the dots placing the union between both for each instance. If a third dimension is added to the analysis, encoding it to each dot’s area will evolve the scatter plot to a bubble chart. More dimensions can be encoded through colours, shapes, etc. As important as dimensionality, the datatypes of each dimension can filter the universe of visualizations to the most suitable ones in each case, as expounded in [29]. Advanced visualizations can be developed by bringing the features of other datatypes together in a multi-dimensional space: time-based cyclical data in polar charts, planar combined with timestamped data in complex timelines, etc. – Temporal: separated from 1-dimensional data, the distinction in temporal data is that items have a start and finish time, which not only covers timestamped data (i.e., a precise moment in time), but those items spanning through time with a defined starting and end date (overlaps are allowed). Time based data is very useful when arranging elements through history in chronological order, for example in medical records, project management or historical presentations. Additionally, temporal data can have a recurrent regularity (e.g., weekly, monthly, every four years, etc.). All this components make temporal data suitable to be displayed in calendars and timelines (either in combination with geographical features or by its own). Nevertheless, as with planar data, time-series data makes a perfect candidate to be mixed with new data dimensions, allowing new analyses over data that changes over time. Multiple domains such as finance, science, public policy and management (to name a few), take the advantage of temporal data to detect patterns and trends in their datasets. Time series forecasting can also be used to predict future values based on the recorded measures in our datasets. Together with multi-dimensional data, temporal information can be represented with the most diverse variety of visualization techniques, relying on the temporal dimension as a principal component of the chart. – Tree (hierarchical): collections of items with each having a link to a parent object (except the root), forming hierarchies or tree-like structures. Hierarchies or tree structures are formed by items having links to other instances as parents, siblings and children in a resemblance to a family tree. These structures have in common a root node, from which the rest of instances grow in depth, until end nodes are reached (items with no children), also known as leaves. Trees provide a great understanding of the overall structure of the data being studied, where analysts are able to perform the first two tasks of Ben Shneiderman’s visualization mantra [46]: “Overview” and “zoom”, gaining an overview of the whole structure and then zooming in the items of interest. Common operations performed over trees include count of total items (e.g., total number of classes in the DBpedia ontology [3]), number of children of a selected node (e.g., child classes of dbo:Agent) and number of elements defined within a node (e.g., instances of dbo:University). The indented tree visual representation has traditionally been used to navigate through file directories in operating systems, or render the structure of software packages in programming suites. The possibility to collapse a subtree made this approach very useful to reach deep nodes within the structure with minimal visual overload and efficient interactive exploration. Adjacency diagrams are a space-filling variant of the previous representations, where the position of a node relative to adjacent items reveals its place in the hierarchy. IciclePartition layouts are similar to dendrograms, with the advantage of providing an additional dimension (area) to display another variable. Sunbursts are a polarcoordinate variant of icicle layouts. Substituting adjacency by containment the treemap concept was introduced [30], displaying structure as a set of nested geometries in a tile layout. Whilst the most widespread geometry used in treemaps are rectangles, other shapes can also be used generating Voronoi, Jigsaw or Circular treemaps. – Network: cases emerge where hierarchical structures are not enough to capture the essence of the relationship among items on a dataset, specially within links among LOD sources. Nodes have no linkage constraints, being free to connect to whatever items they want. This freedom allows to combine similar resources within a diverse set of features. Both external and internal links create O. Peña, U. Aguilera & D. López-de-Ipiña / Linked Open Data Visualization Revisited: A Survey 5 a graph of interlinked items, which need a layout algorithm in order to be displayed due to the lack of hierarchical meaning. The search for an efficient layout that honestly represents the data, depends on the message analysts want to highlight. Sometimes analysts will be looking for the shortest paths between two items, or how many cliques [38] the community is divided in. Taking techniques from the Social Network Analysis (SNA) field, who are the key players of the network can also be understood, using a wide set of metrics to determine relevance [37]. The least known network representation is usually the adjacency matrix, a tool often used by mathematicians and computer scientist to relate items in a 2D space. Each cell encodes the value between the column and row data instances (either showing the number or following a colour palette), and matrix re-arrangement allows to quickly detect clusters and bridges. Besides, no collisions between links can happen, at the expense of requiring a bigger area to display all the information. Filters and selectors can help diminishing the matrix’s width and height. Easier to interpret are node-link diagrams in a graph layout, where nodes represent each data instance, and edges or links between them the attribute through which items are connected. Depending on the algorithm used to display the graph, different attributes will be highlighted in the analysis. For example, the force-directed layout tries to emulate nodes as being particles or a physical system, each one repelling the others and only being pulled together those that share links. Edge weight can be used as a gravity indicator, thus the stronger the link, the closer particles will stay together, whereas the weaker the link, the more remote nodes will be placed. Bigger graphs will populate the visualization with nodes and links, creating giant hairballs with multiple line crossings. Although there are researchers trying to minimise the hairball effect [32], usually high density networks are not suitable for graph rendering. The Linked Data Visualization Model (LDVM) [22], mapped these datatypes to visualization tools and RDF vocabularies, creating the mappings summarised in Table 1. The Linked Data Visualization Wizard (LDVizWiz) [16] also uses this datatype categorisation in order to deal with the semi-automatic generation of visual representations based on LOD.
منابع مشابه
Exploration and Visualization in the Web of Big Linked Data: A Survey of the State of the Art
Data exploration and visualization systems are of great importance in the Big Data era. Exploring and visualizing very large datasets has become a major research challenge, of which scalability is a vital requirement. In this survey, we describe the major prerequisites and challenges that should be addressed by the modern exploration and visualization systems. Considering these challenges, we p...
متن کاملApplication of the Linked Data Visualization Model on Real World Data from the Czech LOD Cloud
In the recent years the Linked Open Data phenomenon has gained a substantial traction. This has lead to a vast amount of data being available on the Web in what is known as the LOD cloud. While the potential of this linked data space is huge, it fails to reach the non-expert users so far. At the same time there is even larger amount of data that is so far not open yet, often because its owners ...
متن کاملVocabulary for Linked Data Visualization Model
There is already a vast amount of Linked Data on the web. What is missing is a convenient way of analyzing and visualizing the data that would benefit from the Linked Data principles. In our previous work we introduced the Linked Data Visualization Model (LDVM). It is a formal base that exploits the principles to ensure interoperability and compatibility of compliant components. In this paper w...
متن کاملPrinciples of High-Dimensional Data Visualization in Astronomy
Astronomical researchers often think of analysis and visualization as separate tasks. In the case of high-dimensional data sets, though, interactive exploratory data visualization can give far more insight than an approach where data processing and statistical analysis are followed, rather than accompanied, by visualization. This paper attempts to charts a course toward “linked view” systems, w...
متن کاملExploring user and system requirements of linked data visualization through a visual dashboard approach
One of the open problems in Semantic Web research is which tools should be provided to users to explore linked data. This is even more urgent now that massive amount of linked data is being released by governments worldwide. The development of single dedicated visualization applications is increasing, but the problem of exploring unknown linked data to gain a good understanding of what is conta...
متن کاملLODWheel - JavaScript-based Visualization of RDF Data
Visualizing Resource Description Framework (RDF) data to support decision-making processes is an important and challenging aspect of consuming Linked Data. With the recent development of JavaScript libraries for data visualization, new opportunities for Web-based visualization of Linked Data arise. This paper presents an extensive evaluation of JavaScript-based libraries for visualizing RDF dat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014